I would like to predict if high scoring determines if you are on a good team or not. A lot of times there are players who score a lot but aren’t on a good team. For many people including fans and scouts it will determine a lot about the player if he is not a good team or not. If that player is on a good team chances are that player has a great upside and potential. I will determine if high scoring means your are on a good team or not. I will do this through my second data source that stores the top 25 ranked teams in the nation and will use data from my first deliverable.
include <- function(library_name){
if( !(library_name %in% installed.packages()) )
install.packages(library_name)
library(library_name, character.only=TRUE)
}
include("tidyverse")
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
include("knitr")
include("caret")
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
include("rvest")
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
include("dplyr")
purl("deliverable1.Rmd", output="part1.r")
##
##
## processing file: deliverable1.Rmd
##
|
| | 0%
|
|......... | 14%
|
|................... | 29%
|
|............................ | 43%
|
|..................................... | 57%
|
|.............................................. | 71%
|
|........................................................ | 86%
|
|.................................................................| 100%
## output file: part1.r
## [1] "part1.r"
source("part1.r")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Player = col_character(),
## School = col_character(),
## Pos = col_character(),
## labels = col_character(),
## Status = col_character()
## )
## See spec(...) for full column specifications.
NCAAData <- as.data.frame(NCAAData)
Here is my second data source which includes the ranked teams. I obtained this data source via web scraping.
rankings_web <- read_html("https://www.ncaa.com/rankings/basketball-men/d1/associated-press")
rank <- rankings_web %>%
html_nodes("tbody") %>%
html_nodes("tr")
Rank <- rank %>%
html_nodes("td:first_child") %>%
html_text() %>%
as.integer()
Team <- rank %>%
html_nodes("td:nth_child(2)") %>%
html_text()
Points <- rank %>%
html_nodes("td:nth_child(3)") %>%
html_text()
Record <- rank %>%
html_nodes("td:nth_child(4)") %>%
html_text()
Rankings <- cbind.data.frame(Rank=Rank, Team=Team, Points=Points, Record=Record)
#tidying data
Rankings$Team <- gsub(" \\(48\\)", "", Rankings$Team)
Rankings$Team <- gsub(" \\(3\\)", "", Rankings$Team)
Rankings$Team <- gsub(" \\(9\\)", "", Rankings$Team)
Rankings$Team <- gsub(" \\(5\\)", "", Rankings$Team)
colnames(Rankings)[colnames(Rankings)=="Team"] <- "School"
colnames(Rankings)[colnames(Rankings)=="Points"] <- "Team_Points"
colnames(Rankings)[colnames(Rankings)=="Rank"] <- "Team_Rank"
colnames(Rankings)[colnames(Rankings)=="Record"] <- "Team_Record"
Rankings$Team_Points <- gsub(",", "", Rankings$Team_Points)
Rankings$Team_Points <- as.double(Rankings$Team_Points)
NCAAData <- merge(x=NCAAData, y=Rankings, by="School", all.x=TRUE)
Printing out NCAAData Table
(NCAAData)
Printing out Rankings Table
(Rankings)
To get a good idea here is a table of the Ranked teams and their total scoring I used from my second dataset.
ggplot(data=Rankings, aes(x=Points)) +
geom_jitter(aes(y=School, color=School))+
labs(title="Top 25 Team Ranked Scoring", x="", y="") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Here I tried to use Points and other columns as an indicator to predict if high scoring determines how good the team is but I did not yield good results. However Rebounds gave me a good indicator to predict that good teams depend on rebounding. In order to use points as a good predictor I would need more paramaters or data.
set.seed(385)
top25 <- filter(NCAAData, !is.na(NCAAData$Team_Points))
sample_selection <- top25$Team_Rank %>%
createDataPartition(p=0.75, list=FALSE)
train <- top25[sample_selection, ]
test <- top25[-sample_selection, ]
train_model <- lm(Team_Rank ~ Minutes_Played + Points + Total_Rebounds + Assists + Steals + Field_Goal_Average, data=top25)
summary(train_model)
##
## Call:
## lm(formula = Team_Rank ~ Minutes_Played + Points + Total_Rebounds +
## Assists + Steals + Field_Goal_Average, data = top25)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.210 -6.784 0.245 6.368 13.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.3392013 1.3291293 12.293 <2e-16 ***
## Minutes_Played 0.0001129 0.0002674 0.422 0.6730
## Points 0.0910102 0.0775013 1.174 0.2406
## Total_Rebounds -0.1663744 0.0713556 -2.332 0.0199 *
## Assists -0.1323454 0.1343128 -0.985 0.3247
## Steals -0.2625879 0.3192057 -0.823 0.4109
## Field_Goal_Average -0.1661954 0.1051725 -1.580 0.1144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.406 on 874 degrees of freedom
## Multiple R-squared: 0.008833, Adjusted R-squared: 0.002028
## F-statistic: 1.298 on 6 and 874 DF, p-value: 0.2553
prediction <- train_model %>% predict(test)
R2(prediction, test$Team_Rank)
## [1] 0.01070385
Here I found that that Win Shares per 40 minutes (the entire game time + if they win) is a good indicator to predict that good teams have decent win shares.
train_model <- lm(Team_Rank ~ Total_Rebounds + Win_Shares_per40_Minutes, data=top25)
summary(train_model)
##
## Call:
## lm(formula = Team_Rank ~ Total_Rebounds + Win_Shares_per40_Minutes,
## data = top25)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1135 -6.8410 0.2134 6.3238 13.4419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.41018 0.67263 21.424 <2e-16 ***
## Total_Rebounds -0.05607 0.06274 -0.894 0.3718
## Win_Shares_per40_Minutes -7.19916 3.88431 -1.853 0.0642 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.396 on 878 degrees of freedom
## Multiple R-squared: 0.006963, Adjusted R-squared: 0.004701
## F-statistic: 3.078 on 2 and 878 DF, p-value: 0.04654
prediction <- train_model %>% predict(test)
R2(prediction, test$Team_Rank)
## [1] 0.02060821